Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

  • Jump to comment-1
    bharath.rupireddyforpostgres@gmail.com2022-08-05T10:25:26+00:00
    Hi, I noticed that dir_open_for_write() in walmethods.c uses write() for WAL file initialization (note that this code is used by pg_receivewal and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in XLogFileInitInternal() to avoid partial writes. Do we need to fix this? Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
    • Jump to comment-1
      michael@paquier.xyz2022-08-06T06:41:18+00:00
      On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote: > I noticed that dir_open_for_write() in walmethods.c uses write() for > WAL file initialization (note that this code is used by pg_receivewal > and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in > XLogFileInitInternal() to avoid partial writes. Do we need to fix > this? 0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/ > Thoughts? Makes sense to me for the WAL segment pre-padding initialization, as we still want to point to the beginning of the segment after we are done with the pre-padding, and the code has an extra lseek(). -- Michael
      • Jump to comment-1
        bharath.rupireddyforpostgres@gmail.com2022-08-07T01:12:11+00:00
        On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote: > > I noticed that dir_open_for_write() in walmethods.c uses write() for > > WAL file initialization (note that this code is used by pg_receivewal > > and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in > > XLogFileInitInternal() to avoid partial writes. Do we need to fix > > this? > > 0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/ Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h so that everyone can use it. > > Thoughts? > > Makes sense to me for the WAL segment pre-padding initialization, as > we still want to point to the beginning of the segment after we are > done with the pre-padding, and the code has an extra lseek(). Thanks. I attached the v1 patch, please review it. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
        • Jump to comment-1
          thomas.munro@gmail.com2022-08-07T02:12:56+00:00
          On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote: > Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h > so that everyone can use it. > > > > Thoughts? > > > > Makes sense to me for the WAL segment pre-padding initialization, as > > we still want to point to the beginning of the segment after we are > > done with the pre-padding, and the code has an extra lseek(). > > Thanks. I attached the v1 patch, please review it. Hi Bharath, +1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the commit message should say that is happening. Maybe the move should even be in a separate patch (IMHO it's nice to separate mechanical change patches from new logic/work patches). +/* + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given + * file of size total_sz in batches of size block_sz. + */ +ssize_t +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz) Hmm, why not give it a proper name that says it writes zeroes? Size arguments around syscall-like facilities are usually size_t. + memset(zbuffer.data, 0, block_sz); I believe the modern fashion as of a couple of weeks ago is to tell the compiler to zero-initialise. Since it's a union you'd need designated initialiser syntax, something like zbuffer = { .data = {0} }? + iov[i].iov_base = zbuffer.data; + iov[i].iov_len = block_sz; How can it be OK to use caller supplied block_sz, when sizeof(zbuffer.data) is statically determined? What is the point of this argument? - dir_data->lasterrno = errno; + /* If errno isn't set, assume problem is no disk space */ + dir_data->lasterrno = errno ? errno : ENOSPC; This coding pattern is used in places where we blame short writes on lack of disk space without bothering to retry. But the code used in this patch already handles short writes: it always retries, until eventually, if you really are out of disk space, you should get an actual ENOSPC from the operating system. So I don't think the guess-it-must-be-ENOSPC technique applies here.
          • Jump to comment-1
            bharath.rupireddyforpostgres@gmail.com2022-08-07T05:11:49+00:00
            On Sun, Aug 7, 2022 at 7:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote: > > Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h > > so that everyone can use it. > > > > > > Thoughts? > > > > > > Makes sense to me for the WAL segment pre-padding initialization, as > > > we still want to point to the beginning of the segment after we are > > > done with the pre-padding, and the code has an extra lseek(). > > > > Thanks. I attached the v1 patch, please review it. > > Hi Bharath, > > +1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the > commit message should say that is happening. Maybe the move should > even be in a separate patch (IMHO it's nice to separate mechanical > change patches from new logic/work patches). Agree. I separated out the changes. > +/* > + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given > + * file of size total_sz in batches of size block_sz. > + */ > +ssize_t > +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz) > > Hmm, why not give it a proper name that says it writes zeroes? Done. > Size arguments around syscall-like facilities are usually size_t. > > + memset(zbuffer.data, 0, block_sz); > > I believe the modern fashion as of a couple of weeks ago is to tell > the compiler to zero-initialise. Since it's a union you'd need > designated initialiser syntax, something like zbuffer = { .data = {0} > }? Hm, but we have many places still using memset(). If we were to change these syntaxes, IMO, it must be done separately. > + iov[i].iov_base = zbuffer.data; > + iov[i].iov_len = block_sz; > > How can it be OK to use caller supplied block_sz, when > sizeof(zbuffer.data) is statically determined? What is the point of > this argument? Yes, removed block_sz function parameter. > - dir_data->lasterrno = errno; > + /* If errno isn't set, assume problem is no disk space */ > + dir_data->lasterrno = errno ? errno : ENOSPC; > > This coding pattern is used in places where we blame short writes on > lack of disk space without bothering to retry. But the code used in > this patch already handles short writes: it always retries, until > eventually, if you really are out of disk space, you should get an > actual ENOSPC from the operating system. So I don't think the > guess-it-must-be-ENOSPC technique applies here. Done. Thanks for reviewing. PSA v2 patch-set. Also,I added a CF entry https://commitfest.postgresql.org/39/3803/ to give the patches a chance to get tested. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
            • Jump to comment-1
              michael@paquier.xyz2022-08-07T07:56:00+00:00
              On Sun, Aug 07, 2022 at 10:41:49AM +0530, Bharath Rupireddy wrote: > Agree. I separated out the changes. + +/* + * A convenience wrapper for pwritev() that retries on partial write. If an + * error is returned, it is unspecified how much has been written. + */ +ssize_t +pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) If moving this routine, this could use a more explicit description, especially on errno, for example, that could be consumed by the caller on failure to know what's happening. >> +/* >> + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given >> + * file of size total_sz in batches of size block_sz. >> + */ >> +ssize_t >> +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz) >> >> Hmm, why not give it a proper name that says it writes zeroes? > > Done. FWIW, when it comes to that we have a couple of routines that just use '0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it introduces in fe_utils.c a new wrapper (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper (pg_pwritev_with_retry) for pwrite(). A second thing is that pg_pwritev_with_retry_and_write_zeros() is designed to work on WAL segments initialization and it uses XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing in its name that tells us so. This makes me question whether file_utils.c is a good location for this second thing. Could a new file be a better location? We have a xlogutils.c in the backend, and a name similar to that in src/common/ would be one possibility. -- Michael
              • Jump to comment-1
                thomas.munro@gmail.com2022-08-07T09:49:19+00:00
                On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier <michael@paquier.xyz> wrote: > FWIW, when it comes to that we have a couple of routines that just use > '0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it > introduces in fe_utils.c a new wrapper > (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper > (pg_pwritev_with_retry) for pwrite(). What about pg_write_initial_zeros(fd, size)? How it writes zeros (ie using vector I/O, and retrying) seems to be an internal detail, no? "initial" to make it clearer that it's at offset 0, or maybe pg_pwrite_zeros(fd, size, offset). > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > designed to work on WAL segments initialization and it uses > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > in its name that tells us so. This makes me question whether > file_utils.c is a good location for this second thing. Could a new > file be a better location? We have a xlogutils.c in the backend, and > a name similar to that in src/common/ would be one possibility. Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or maybe it's OK to use BLCKSZ with a comment to say that it's a bit arbitrary, or maybe it's better to define a new zero buffer of some arbitrary size just in this code if that is too strange. We could experiment with different size buffers to see how it performs, bearing in mind that every time we double it you halve the number of system calls, but also bearing in mind that at some point it's too much for the stack. I can tell you that the way that code works today was not really written with performance in mind (unlike, say, the code reverted from 9.4 that tried to do this with posix_fallocate()), it was just finding an excuse to call pwritev(), to exercise new fallback code being committed for use by later AIO stuff (more patches coming soon). The retry support was added because it seemed plausible that some system out there would start to do short writes as we cranked up the sizes for some implementation reason other than ENOSPC, so we should make a reusable retry routine. I think this should also handle the remainder after processing whole blocks, just for completeness. If I call the code as presented with size 8193, I think this code will only write 8192 bytes. I think if this ever needs to work on O_DIRECT files there would be an alignment constraint on the buffer and size, but I don't think we have to worry about that for now.
                • Jump to comment-1
                  bharath.rupireddyforpostgres@gmail.com2022-08-07T15:52:39+00:00
                  On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > > designed to work on WAL segments initialization and it uses > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > > in its name that tells us so. This makes me question whether > > file_utils.c is a good location for this second thing. Could a new > > file be a better location? We have a xlogutils.c in the backend, and > > a name similar to that in src/common/ would be one possibility. > > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or > maybe it's OK to use BLCKSZ with a comment to say that it's a bit > arbitrary, or maybe it's better to define a new zero buffer of some > arbitrary size just in this code if that is too strange. We could > experiment with different size buffers to see how it performs, bearing > in mind that every time we double it you halve the number of system > calls, but also bearing in mind that at some point it's too much for > the stack. I can tell you that the way that code works today was not > really written with performance in mind (unlike, say, the code > reverted from 9.4 that tried to do this with posix_fallocate()), it > was just finding an excuse to call pwritev(), to exercise new fallback > code being committed for use by later AIO stuff (more patches coming > soon). The retry support was added because it seemed plausible that > some system out there would start to do short writes as we cranked up > the sizes for some implementation reason other than ENOSPC, so we > should make a reusable retry routine. Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ reduces the system calls to half (right now, pg_pwritev_with_retry() gets called 64 times per 16MB WAL file, it writes in the batches of 32 blocks per call). Is there a ready-to-use tool or script or specific settings for pgbench (pgbench command line options or GUC settings) that I can play with to measure the performance? > I think this should also handle the remainder after processing whole > blocks, just for completeness. If I call the code as presented with size > 8193, I think this code will only write 8192 bytes. Hm, I will fix it. > I think if this ever needs to work on O_DIRECT files there would be an > alignment constraint on the buffer and size, but I don't think we have > to worry about that for now. We can add a comment about the above limitation, if required. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
                  • Jump to comment-1
                    bharath.rupireddyforpostgres@gmail.com2022-08-08T12:40:23+00:00
                    On Sun, Aug 7, 2022 at 9:22 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > > > designed to work on WAL segments initialization and it uses > > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > > > in its name that tells us so. This makes me question whether > > > file_utils.c is a good location for this second thing. Could a new > > > file be a better location? We have a xlogutils.c in the backend, and > > > a name similar to that in src/common/ would be one possibility. > > > > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or > > maybe it's OK to use BLCKSZ with a comment to say that it's a bit > > arbitrary, or maybe it's better to define a new zero buffer of some > > arbitrary size just in this code if that is too strange. We could > > experiment with different size buffers to see how it performs, bearing > > in mind that every time we double it you halve the number of system > > calls, but also bearing in mind that at some point it's too much for > > the stack. I can tell you that the way that code works today was not > > really written with performance in mind (unlike, say, the code > > reverted from 9.4 that tried to do this with posix_fallocate()), it > > was just finding an excuse to call pwritev(), to exercise new fallback > > code being committed for use by later AIO stuff (more patches coming > > soon). The retry support was added because it seemed plausible that > > some system out there would start to do short writes as we cranked up > > the sizes for some implementation reason other than ENOSPC, so we > > should make a reusable retry routine. > > Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ > reduces the system calls to half (right now, pg_pwritev_with_retry() > gets called 64 times per 16MB WAL file, it writes in the batches of 32 > blocks per call). > > Is there a ready-to-use tool or script or specific settings for > pgbench (pgbench command line options or GUC settings) that I can play > with to measure the performance? I played with a simple insert use-case [1] that generates ~380 WAL files, with different block sizes. To my surprise, I have not seen any improvement with larger block sizes. I may be doing something wrong here, suggestions on to test and see the benefits are welcome. > > I think this should also handle the remainder after processing whole > > blocks, just for completeness. If I call the code as presented with size > > 8193, I think this code will only write 8192 bytes. > > Hm, I will fix it. Fixed. I'm attaching v5 patch-set. I've addressed review comments received so far and fixed a compiler warning that CF bot complained about. Please review it further. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
                    • Jump to comment-1
                      bharath.rupireddyforpostgres@gmail.com2022-08-09T07:30:35+00:00
                      On Mon, Aug 8, 2022 at 6:10 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > I played with a simple insert use-case [1] that generates ~380 WAL > files, with different block sizes. To my surprise, I have not seen any > improvement with larger block sizes. I may be doing something wrong > here, suggestions on to test and see the benefits are welcome. > > > > I think this should also handle the remainder after processing whole > > > blocks, just for completeness. If I call the code as presented with size > > > 8193, I think this code will only write 8192 bytes. > > > > Hm, I will fix it. > > Fixed. > > I'm attaching v5 patch-set. I've addressed review comments received so > far and fixed a compiler warning that CF bot complained about. > > Please review it further. I tried to vary the zero buffer size to see if there's any huge benefit for the WAL-generating queries. Unfortunately, I didn't see any benefit on my dev system (16 vcore, 512GB SSD, 32GB RAM) . The use case I've tried is at [1] and the results are at [2]. Having said that, the use of pg_pwritev_with_retry() in walmethods.c will definitely reduce number of system calls - on HEAD the dir_open_for_write() makes pad_to_size/XLOG_BLCKSZ i.e. 16MB/8KB = 2,048 write() calls and with patch it makes only 64 pg_pwritev_with_retry() calls with XLOG_BLCKSZ zero buffer size. The proposed patches will provide straight 32x reduction in system calls (for pg_receivewal and pg_basebackup) apart from the safety against partial writes. [1] /* built source code with release flags */ ./configure --with-zlib --enable-depend --prefix=$PWD/inst/ --with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2' > install.log && make -j 8 install > install.log 2>&1 & \q ./pg_ctl -D data -l logfile stop rm -rf data /* ensured that nothing exists in OS page cache */ free -m sudo su sync; echo 3 > /proc/sys/vm/drop_caches exit free -m ./initdb -D data ./pg_ctl -D data -l logfile start ./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "64GB";' ./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";' ./psql -d postgres -c 'ALTER SYSTEM SET work_mem = "16MB";' ./psql -d postgres -c 'ALTER SYSTEM SET checkpoint_timeout = "1d";' ./pg_ctl -D data -l logfile restart ./psql -d postgres -c 'create table foo(bar int);' ./psql -d postgres \timing insert into foo select * from generate_series(1, 100000000); /* this query generates about 385 WAL files, no checkpoint hence no recycle of old WAL files, all new WAL files */ [2] HEAD Time: 84249.535 ms (01:24.250) HEAD with wal_init_zero off Time: 75086.300 ms (01:15.086) #define PWRITEV_BLCKSZ XLOG_BLCKSZ Time: 85254.302 ms (01:25.254) #define PWRITEV_BLCKSZ (4 * XLOG_BLCKSZ) Time: 83542.885 ms (01:23.543) #define PWRITEV_BLCKSZ (16 * XLOG_BLCKSZ) Time: 84035.770 ms (01:24.036) #define PWRITEV_BLCKSZ (64 * XLOG_BLCKSZ) Time: 84749.021 ms (01:24.749) #define PWRITEV_BLCKSZ (256 * XLOG_BLCKSZ) Time: 84273.466 ms (01:24.273) #define PWRITEV_BLCKSZ (512 * XLOG_BLCKSZ) Time: 84233.576 ms (01:24.234) -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/